AITopics | multimodal video model

Collaborating Authors

multimodal video model

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Animal-Bench: Benchmarking Multimodal Video Models for Animal-centric Video Understanding

Neural Information Processing SystemsFeb-16-2026, 15:02:57 GMT

With the emergence of large pre-trained multimodal video models, multiple benchmarks have been proposed to evaluate model capabilities.

benchmark, large language model, machine learning, (17 more...)

Neural Information Processing Systems

Country: Asia > China > Beijing > Beijing (0.04)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Industry: Information Technology (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
Information Technology > Artificial Intelligence > Vision > Video Understanding (0.64)

Add feedback

Animal-Bench: Benchmarking Multimodal Video Models for Animal-centric Video Understanding

Neural Information Processing SystemsDec-26-2025, 14:12:04 GMT

With the emergence of large pre-trained multimodal video models, multiple benchmarks have been proposed to evaluate model capabilities. However, most of the benchmarks are human-centric, with evaluation data and tasks centered around human applications. Animals are an integral part of the natural world, and animal-centric video understanding is crucial for animal welfare and conservation efforts. Yet, existing benchmarks overlook evaluations focused on animals, limiting the application of the models. To address this limitation, our work established an animal-centric benchmark, namely Animal-Bench, to allow for a comprehensive evaluation of model capabilities in real-world contexts, overcoming agent-bias in previous benchmarks.

artificial intelligence, benchmark, proceedings, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.41)

Add feedback

Perception Test: A Diagnostic Benchmark for Multimodal Video Models

Neural Information Processing SystemsDec-26-2025, 06:51:37 GMT

We propose a novel multimodal video benchmark - the Perception Test - to evaluate the perception and reasoning skills of pre-trained multimodal models (e.g.

diagnostic benchmark, name change, perception test, (3 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.35)
Information Technology > Artificial Intelligence > Machine Learning (0.35)

Add feedback

8fa604a81e5a236e2f38e917109571a3-Paper-Conference.pdf

Neural Information Processing SystemsOct-10-2025, 09:29:23 GMT

benchmark, dataset, video, (13 more...)

Neural Information Processing Systems

Country:

Asia > China > Beijing > Beijing (0.04)
North America > United States > New York (0.04)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Industry: Information Technology (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

Perception Test: A Diagnostic Benchmark for Multimodal Video Models

Neural Information Processing SystemsJan-19-2025, 12:39:11 GMT

For these purposes, the Perception Test introduces 11.6k real-world videos, 23s average length, designed to show perceptually interesting situations, filmed by around 100 participants worldwide. The videos are densely annotated with six types of labels (multiple-choice and grounded video question-answers, object and point tracks, temporal action and sound segments), enabling both language and non-language evaluations. The fine-tuning and validation splits of the benchmark are publicly available (CC-BY license), in addition to a challenge server with a held-out test split.

diagnostic benchmark, multimodal video model, perception test, (1 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.37)
Information Technology > Artificial Intelligence > Machine Learning (0.37)

Add feedback

TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models

Cai, Mu, Tan, Reuben, Zhang, Jianrui, Zou, Bocheng, Zhang, Kai, Yao, Feng, Zhu, Fangrui, Gu, Jing, Zhong, Yiwu, Shang, Yuzhang, Dou, Yao, Park, Jaden, Gao, Jianfeng, Lee, Yong Jae, Yang, Jianwei

arXiv.org Artificial IntelligenceOct-15-2024

Understanding fine-grained temporal dynamics is crucial for multimodal video comprehension and generation. Due to the lack of fine-grained temporal annotations, existing video benchmarks mostly resemble static image benchmarks and are incompetent at evaluating models for temporal understanding. In this paper, we introduce TemporalBench, a new benchmark dedicated to evaluating fine-grained temporal understanding in videos. TemporalBench consists of ~10K video question-answer pairs, derived from ~2K high-quality human annotations detailing the temporal dynamics in video clips. As a result, our benchmark provides a unique testbed for evaluating various temporal understanding and reasoning abilities such as action frequency, motion magnitude, event order, etc. Moreover, it enables evaluations on various tasks like both video question answering and captioning, both short and long video understanding, as well as different models such as multimodal video embedding models and text generation models. Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench, demonstrating a significant gap (~30%) between humans and AI in temporal understanding. Furthermore, we notice a critical pitfall for multi-choice QA where LLMs can detect the subtle changes in negative captions and find a centralized description as a cue for its prediction, where we propose Multiple Binary Accuracy (MBA) to correct such bias. We hope that TemporalBench can foster research on improving models' temporal reasoning capabilities. Both dataset and evaluation code will be made available.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2410.10818

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Oceania > Australia > Western Australia > Perth (0.04)
North America > United States > Wisconsin > Dane County > Madison (0.04)
(13 more...)

Genre:

Research Report > New Finding (0.48)
Research Report > Promising Solution (0.34)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback